IGPred-HDnet: Prediction of Immunoglobulin Proteins Using Graphical Features and the Hierarchal Deep Learning-Based Approach

Motivation. Immunoglobulin proteins (IGP) (also called antibodies) are glycoproteins that act as B-cell receptors against external or internal antigens like viruses and bacteria. IGPs play a significant role in diverse cellular processes ranging from adhesion to cell recognition. IGP identifications via the in-silico approach are faster and more cost-effective than wet-lab technological methods. Methods. In this study, we developed an intelligent theoretical deep learning framework, “IGPred-HDnet” for the discrimination of IGPs and non-IGPs. Three types of promising descriptors are feature extraction based on graphical and statistical features (FEGS), amphiphilic pseudo-amino acid composition (Amp-PseAAC), and dipeptide composition (DPC) to extract the graphical, physicochemical, and sequential features. Next, the extracted attributes are evaluated through machine learning, i.e., decision tree (DT), support vector machine (SVM), k-nearest neighbour (KNN), and hierarchical deep network (HDnet) classifiers. The proposed predictor IGPred-HDnet was trained and tested using a 10-fold cross-validation and independent test. Results and Conclusion. The success rates in terms of accuracy (ACC) and Matthew's correlation coefficient (MCC) of IGPred-HDnet on training and independent dataset (Dtrain Dtest) are ACC = 98.00%, 99.10%, and MCC = 0.958, and 0.980 points, respectively. The empirical outcomes demonstrate that the IGPred-HDnet model efficacy on both datasets using the novel FEGS feature and HDnet algorithm achieved superior predictions to other existing computational models. We hope this research will provide great insights into the large-scale identification of IGPs and pharmaceutical companies in new drug design.


Introduction
Immunoglobulins are serum proteins in the human body. Tese proteins act as an antibody involved in the various cellular processes such as a decision, binding, or recognition of the cell. Immunoglobulin signifcantly boosts the immune system by discovering the dangerous macromolecules that entered the body [1]. When unfamiliar elements inject into the body, the immune system has a unique skill to detect the attacker and then activates B lymphocytes to hide the immunoglobulin from invader antigens. For instance, immunoglobulins will deactivate the toxin by altering its chemical structure when averting its appearance. To provide a shield against bacterial infection, stabilin-2 can attach to both Grampositive and Gram-negative bacterial contagions.
Immunoglobulins are linked/related to various disease treatments [2], such as autoimmune, infammation in the skin, and Bechet's diseases [3,4]. In other words, intravenous immunoglobulin provides a fghting strength to cure such kinds of diseases for people who have sufered from muscle problems and systemic swelling in skin infections. Te use of immunoglobulin for lupus erythematosus dermatosis in association with the treatment of Bechet's infection has a great potential without any harmful impact [3,4]. In Ref. [5], it is shown that immunoglobulins have a better understanding of immunological processes, permitting the development of an enhanced version of drugs to cure the infection. Considering the medical application of immunoglobulin proteins, in-depth knowledge of their functional level is still under development.
Over the past years, immunoglobulin protein classifcation and characterization have become a hot topic in bioinformatics and computational biology. Wet-lab approaches such as X-ray crystallography and mass spectrometry are used to discover immunoglobulin proteins. However, such laboratory-based approaches are unfavourable due to their high cost and time consumption. In this regard, researchers have designed various machine learningbased methods to identify immunoglobulin protein sequence analysis. Efcient machine learning-based methods can quickly and accurately predict unannotated proteins from large databases. Machine learning techniques are applied in numerous areas of medicine like diagnostics. Clonal dynamics and relative frequencies are utilized to develop an antibody clonal examining framework to explore certain antigenic human monoclonal antibodies [5][6][7]. In the various feld of the healthcare system, immunological and biological usage, including infection control, immunization diagnostics, and B-cell detection, is of key signifcance [8,9]. Te research community has reported numerous studies related to antigen range that can be selected by specifc antibodies or by a group of antibodies, e.g., antibody stock provided by applying a Rep-Seq in many areas [10]. Te said key observation headed to another and well-defned technique for tackling the B-cell epitope detection in which the intellectual purpose of a specifc antibody is detected [11,12]. Tis study incorporates optical, electrochemical, and piezoelectric biosensors to predict complete immunoglobulin degrees, in which electrochemical is most generally employed. Several immunoglobulin optical biosensors depend on surface plasmon resonance (SPR) prediction present in bufer solutions. For an immunoglobulin study, these available state-of-the-art technologies are useful; however, conducting the biochemical study is very expensive in terms of money and time. For accurate and speedy execution of a huge amount of protein data, it is a need of time to develop a computational framework for immunoglobulins. For example, the frst phase declares the purpose of immunoglobulins proteins which design a useful and inexpensive framework to predict them efciently. Te research community has designed various frameworks based on machine learning procedures for protein sequence analysis and classifcation in the last decades [13][14][15][16][17]. In bioinformatics, predicting immunoglobulins transforms protein sequences into feature metrics to uncover the core formation of proteins. Te essential characteristics of protein prediction are itemized as follows: feature representation and key feature selection based on their importance and classifcation. Amino acid composition (AAC), dipeptides (Dip), and tripeptides are feature extraction techniques to extract n-gram features representation, where the occurrence of n-length peptides are utilized as feature matrices [18][19][20].
Furthermore, another feature extraction method pseudo-amino acid composition (PseAAC), is commonly implemented, considering physicochemical properties among residues [15,17,[21][22][23]. Te pseudotype protein structure led to a protein density drop in dyscalculia; for this purpose, the notion of pseudo-K-tuple is combined with the idea of PseAAC [24,25] to design a framework of AAC minimized with pseudo-K-tuples amino acid composition (PseKRAAC) [26]. Tey developed a classifer IGPred by considering nine (9) physicochemical properties of amino acid-generated proteins with replica ACC [27,28]. In Ref. [29], a predictor was developed via a support vector machine (SVM) to predict immunoglobulins and nonimmunoglobulins. Tey used PseAAC with nine physical and chemical characteristics of amino acids; A cross-validation technique was used to train a model, and they got 96.3% accuracy. However, the performance is good but still needs an efcient bioinformatics tool to predict immunoglobulin with a less error rate.
Various feature representations and multifaced prediction methods may produce unnecessary knowledge representation [30,31]. However, to deal with this problem, many studies suggested feature selection algorithms for eliminating unnecessary information to enhance the performance of the prediction methods. Te frst one is PCC, which stands for Pearson's correlation coefcient, used to measure the signifcance of feature representation in a subgroup. In contrast, the second part is related to computing the repetition among features representation by using Euclidean distance (ED), cosine distance (CD), and Tanimoto (TO). Maximum-Relevance-Maximum-Distance in [32,33] and Analysis of Variance (ANOVA) in [34] are typical feature selection approaches. For optimum feature representation, [35][36][37] used the principal component analysis (PCA) and misclassifcation error (MCE) to extract optimal feature representation for pentatricopeptide-repeat proteins prediction and got 97.9% accuracy. Li et al. in [33] used the above method to design a model for the prediction of anticancer peptide sequences with 19-dimensional attributes.
Although signifcant contribution has been devoted to the prediction of IGPs, some shortcomings should be acknowledged in terms of feature-encoding schemes and learning models. One major limitation of the existing methods is the lack of feature learning algorithms to extract the structured pattern information from protein sequences properly. Secondly, only machine learning classifers are not accurate enough to discriminate IGPs from non-IGPs. Tirdly, the developed immunoglobulin predictors only showed the training dataset results using a cross-validation test while ignoring the external/independent test results. Independent test results are signifcant as they show the trained model's generalization power.
To our best knowledge, IGPred-HDnet is the frst deep learning-based predictor for identifying IGPs. IGPred-HDnet extracts the nominal feature vectors using novel feature descriptors such as FEGS (extract the graphical features), AAPse (extracting physicochemical features), and DPC (sequential features) from the given protein sequence and fed to the hierarchical deep net model (HDnet) as the base classifer for constructing the model. Te model opts for deep representations instead of manually extracted handcrafted features and aims to perform the classifcation of IGPs. We have validated the model through exhaustive methods which shows that the overall prediction on both training and testing datasets outperformed the existing stateof-the-art methods. Te study provides great insights into the large-scale identifcation of IGPs which pharmaceutical companies can opt for in novel drug design.

Materials and Methods
In the subsequent subsections, we will describe the stepwise approach to the classifcation of IGPs. Figure 1 shows these stepwise approaches. Firstly, the dataset collection and preprocessing method will be discussed. Te feature representation method will be presented in the next section; the classifcation framework and model evaluation will be disused in the third stage of the methodology.

Dataset Construction and Preprocessing.
Tis portion will discuss dataset collection for experimenting, i.e., training and evaluating the designed framework. Te dataset contained the immunoglobulins sequences downloaded from the UniProt database present in or outside the cell membrane. Tere are some standard techniques to assure the quality of the baseline dataset; in the frst stage, we eliminated the ambiguous residues, i.e., "B," "J," "O," "X," "U," and "Z" from the protein sequences to obtained typical amino acid sequences [38]. We also eliminate the sequence if it is the portion of other proteins. We picked the protein sequences from the human, mouse, and rat categories in the second stage. We used CD-HIT software to diminish hugely indistinguishable bias in the last stage, which caused overftting predicted results, and the cutof value is set at 60%: Our dataset D consists of 302 samples, with 110 positive D + and 192 negative D − samples of immunoglobulins for training the model: Our independent dataset indD contains 112 samples to evaluate our trained model, of which 40 are positive indD + and 72 are negative indD − samples. Overall, 150 positive and 264 negative samples are provided in Supplementary File S1 and Supplementary File S2, respectively.

Existing Feature Extraction Schemes.
In designing a computerized framework, a series of steps are carried out to predict immunoglobulins. Among them, the feature extraction scheme is a challenging and essential step in formulating a biological sequence into some numerical values [39]. Conventional classifcation learning models, including K-nearest neighbour (KNN), random forest (RF) [40,41], and support vector machine (SVM) [42], are based on fxedlength statistical values and are unable to handle the variable-length protein sequence; hence, the features representation algorithm can tackle this problem by extracting the fxed-length feature vector form the variable-length sequences [43][44][45]. Several researchers have used diferent feature encoding schemes [46] as shown in Figure 2; however, none of them used the proposed method for extracting vital pattern information from the immunoglobulins. A detailed description is given in Section 2.3.

Feature Extraction Based on Graphical and Statistical Features (FEGS).
Herein, we have opted for a novel feature representation method named Feature Extraction based on Graphical and Statistical features (FEGS) [47] for immunoglobulins sequences, as shown in Figure 3. Te proposed deep neural network is not novel; however, the extraction of features through this method is novel. Extracting the hidden pattern information through graphs is diferent from other sequence-based feature descriptors. Te main shortcoming of traditional methods is the loss of sequence order information. For example, amino acid composition and reduced amino acid alphabet cannot retain the protein's global correlated properties. Furthermore, the manual extraction of features requires extensive approaches which can be somehow not sufcient. Tese handcrafted features are not that much powerful to discriminate biological sequences as compared to the deep representations, as shown in [15]. Te FEGS algorithm was proposed to tackle this issue by Computational Intelligence and Neuroscience formulating the biological proteins using a three-dimensional curve. Te working principle of the FEGS algorithm is that initially, FEGS employs the graphical depiction of primary proteins using circular cones in 3D space by extending the notion of 3D protein paths. Secondly, using the physicochemical properties of amino acids that efciently extract the statistical attributes of protein pairs, FEGS seeks to form many circular cones in 3D space. Finally, the 578-dimensional vector is generated by combining mono-amino acid and dipeptide compositions for each protein sequence.
Initially, the protein sequences are provided in the FASTA format as input, and then FEGS starts eliminating unnecessary indices with identical values and generating 158 space curves for the subsequent protein sequence.

Generation of 3D Graphical
Curves for Immunoglobulins Sequences. In this method, the protein sequences are provided in the FASTA format as input; then, according to their physicochemical indices, 20 amino acids are frst linked with 20 points in the 3D area. In the second step, the   graphical curve of an immunoglobulins sequence can be generated by enlarging a 3D protein track centred on a right circular cone.
(1) Preparation of the 20 Amino Acids and the 400 Amino Acid Sets. Physicochemical properties (PCP) of amino acids (AAs) play a vital role in analyzing and characterizing protein function. We arranged the 20 AAs with respect to their PCP from lower to higher order. Ten, we organized them on the circumference of the bottommost of a right circular cone with a height of 1 by the following formula: Te above equation A i denoted 20 amino acids, whereas all 400 amino acid pairs are linked to the bottom of the right circular cone via the formula below: A i A j represents each of the 400 amino acid pairs.
Consider that we have a protein sequence S having N AA residues S � s 1 s 2 ...s N . Constructing the 3D graph for the protein sequence is quite challenging. Te 3D graphical curve is generated by enlarging a 3D protein track centred on a right circular cone as follows. Initiating from the origin point p 0 � (0, 0, 0) broadens it to the subsequent point p 1 (x 1, y 1 , z 1 ) in the 3D area, conforming to the frst AA s 1 and the second point p 2 (x 2 , y 2 , z 3 ) related to the second AA s 2 and so on till the 3D track is accomplished at the last AA s N , and via this process, the P path is obtained, coordinating with a 3D graphical curve of the immunoglobulins sequence S, whereas P i (x i , y i , z i ) is the i th amino acid S i , and the point coordinates x i , y i and z i are described in the following formulas: In the above equation, Ψ(S 0 ) � (0, 0, 0), and f A 1 A 2 is the number of amino acid sets determined. Te selected 158 physicochemical properties are linked with the exclusive right circular cone; in this way, we got 158 various 3-dimensional graphical curves for every immunoglobulin sequence related to the 158 physicochemical properties of amino acids.

Numerical Features of Protein Sequences.
Another challenging job is to transform the generated graphical curves into numerical feature vectors for the similarity analysis of immunoglobulins samples. Here, for each curve, the L/L matrix denoted by M is calculated, and of-diagonal values M i,j (i ≠ j) are well-defned as a measure of the Euclidean distance and the sum of geometric lengths of boundaries between P i and P j of the curve. At the same time, on-diagonal elements are equal to zero. Subsequently, all 158 curves are converted into 158-dimensional feature representation matrices as a graphical features representation described below: Tere are many other feature extraction techniques in which AAC and DPC are commonly utilized in protein sequence analyses. To count the frequency of AA in a given sequence, normalized by sequence length, AAC is widely used for this process to extract 20 fxed-length features as formulated below: Te above equation f represents the number of AA occurrences in the protein sequence. DPC also counts the number of occurrences of the 400 AA sets of the given protein sequence; and it extracts 400 fxed-length features below: where f represents the number of occurrences of j th AA sets, i.e., AA, AC, AD, AE, .YY { } in the protein sequence. Te statistical features, i.e., AA V a and DPC V d are merged with graphical features represented V g to get a 578-dimensional feature vector for the protein sequence S. In general, a dataset that contains N number of immunoglobulins sequences is given to FEGS, then we can get the N × 578 feature representation matrix, in which every row represents a feature representation vector of immunoglobulins sequences.

The Proposed Model Workflow
We developed a robust immunoglobulins predictor called Immunoglobulin Proteins Prediction Hierarchical Deep net (IGPred-HDnet). Figure 4 illustrates the fow of the proposed framework, in which the main stages of the IGPred-HDnet framework are shown such as data collection, data distribution, feature representation computation through FEGS, and classifcation through HDNet and evaluation. In feature representation, a novel feature encoding method is proposed to extract valuable feature representation from immunoglobulin sequences. [48], which is a substitute for a deep neural network (DNN) to learn hyperlevel feature representation using various resources and eforts. In contrast, DNN used complex architecture, i.e., forward and backward propagation algorithms, to learn hidden information. In developing an HDnet classifer, it is crucial to determine the learning algorithms employed in each layer. In our proposed model, we set the combination of Extreme Gradient Boost (XGBoost) [49,50], random forest (RF) [51][52][53], and extremely randomized trees (ERT) [54,55] classifers which achieved outstanding performance and feed it with the Computational Intelligence and Neuroscience previously computed 578-dimensional vector. HDnet is based on the deep ensemble method that cascades conventional classifers, for example, RF, ERT, and XGBoost. Compared to DNN, HDnet uses decision trees instead of various neural network (NN) models for feature representation learning in each layer. Figure 5 shows the generic representation of HDNet, elaborating that if there are multiple feature vectors from multiple encoding schemes, they are concatenated at the level-N. Tese feature vectors are actually deep representations learnt at diferent layers, similar to other deep neural networks. Due to the hierarchical type nature, the HDnet model allows the training process to be more robust, and it will be more appropriate for training a limited amount of protein samples. DNN involves various parameters that need tunes during training a model, while our proposed model easily tunes the hyperparameter.

Hierarchical Deep Net Model (HDnet). Te hierarchical deep net (HDnet) model is an ensemble-based model inspired by
We set the boosting parameter value k � 20 for the XGBoost classifer. For RF and ERT, the number of decision trees is also set at 20, and the node values are picked by randomly picking features. In our model, every layer is an ensemble of diverse learners (e.g., six XGBoost, six RF, and six ERT) who accept the feature representation processed by previous layer classifcation models. Te outcome of the previous layer is the input for the subsequent layer for processing. To produce the enhanced feature representation related to the multivariate class vectors, we have integrated, stacked, and summed output as a supreme probability score. Te process of training is terminated if enhancement is not observed in performance. Figure 5 reveals the layer-by-layer framework of the HDnet.

Performance Evaluation
In this research, we utilized four performance evaluation measures, e.g., accuracy (ACC), specifcity (SP), sensitivity (SN), and Matthew correlation coefcient (MCC), to fgure out the achievement rate of our proposed prediction models described as In the above equations, TP represents True-IGPs, which are correctly predicted as positive instances, whereas TN corresponds to true non-IGPs, which are correctly classifed as negative samples. FN indicates non-IGPs, which the model incorrectly predicts as immunoglobulins.
Te performance above measures containing the MCC is dependent on the threshold, which delivers the comprehensive evaluation for the binary class classifcation. Furthermore, to describe the model performance on a large scale, we utilized the Area Under the ROC (Receiver Operating Characteristic) Curve (AUC), which is in the shape of an independent threshold analysis like a further essential assessment of the model.

Proposed Framework Evaluation
In machine learning (ML), the model performance is naturally assessed via cross-validation (CV). Tere are three tests in the research community to determine the discriminatory power of the designed framework: K-fold also called subsampling, Jackknife, i.e., leave-one-out and independent tests [56,57]. Te Jackknife test provides exceptional and encouraging results to train a model [58]; however, the main cons are computational cast due to a large number of calculations [59]. To overcome the weakness of the Jackknife and improve the simplifcation power, we implemented the K-Fold CV test to train our model and test the performance [60]. In this method, we randomly divided the train data into K-folds (subsets), in which K − 1 is utilized to train the proposed model, and the leftover is utilized to test the model [61]. Subsequently, for the particular approximation, the obtained results are averaged. We set the value of K to 10 after conducting various experiments.

Predictive Performance of Hypothesis Learners Using Various Feature Encoding Schemes on Training Dataset D train .
In this section, we experimentally determine the prediction performance of various classifers, i.e., KNN [62], DT [63], SVM [46,64], and HDnet using various descriptors, i.e., APAAC (physicochemical features), DPC (sequential features), and FEGS (graphical features), as shown in Figure 6. Each learning engine is computed by conducting a ten-fold  Table 1. First, the HDnet model consistently produced the best outcomes among the classifcation algorithms compared to other machine-learning classifers for Computational Intelligence and Neuroscience 7 all feature encoding schemes. Te main reason is due to the high learning potential of a deep neural network as compared to the conventional classifers. Te internal structure of the HDnet classifer is based on decision trees that enable the model to predict the extracted features better [65]. Further, it is evident in the literature that deeper networks have more learning potential as compared to conventional neural networks [15,66,67]. Secondly, among the feature representation approaches, FEGS (graphical features) produced the best results for overall hypothesis learners (classifers) than other feature vectors such as DPC and APAAC. Te underlyingreason for the high prediction rate of FEGS methods is that FEGS extracts the conserved local and global graphical, physicochemical and statistical attributes from a protein sequence. As in Figure 1, the visualization infuence of the extracted features through t-distributed stochastic neighbour embedding (t-SNE) can be seen. Te red colour represents the IGPs class, and the green colour represents the non-IGPs class. Te features with a high correlation, like DPC and APAAC, cannot incorporate the correct predictions of immunoglobulins. In contrast, the novel features of FEGS are less correlated enabling the classifers to produce high performance.

Predictive Performance of Hypothesis Learners Using Various Feature Encoding Schemes on the Testing Dataset D test .
In this subsection, we examine the success rates of our model via an independent test to show its generalization power. It was ensured that the samples in the independent test D test were unseen, and none of the immunoglobulin samples was used in training the model. Table 1 depicts the prediction outcomes of all classifers using the APAAC, DPC, and FEGS feature methods. Comparative analysis reveals that our proposed learning model HDnet using novel feature FEGS     Table 2 and Figure 8.
Te underlying reason for achieving high predictions is to extract the graphical-based, physicochemical-based, and sequence-based attributes. Also, the hierarchical type structure of the HDnet classifer enables a better forecast of the IGs samples from the extracted attributes [65].

Conclusion and Future Work
IGPs are a crucial constituent of the immune system. Understanding deep insight IGPs can provide useful hints in drug discovery for disease treatment. Tus, the objective of this research was to construct a novel sequence-based computational method for predicting and analyzing IGPs. Te proposed theoretical model "IGPred-HDnet" is superior to other advance immunoglobulin-based predictors due to several reasons. Firstly, we designed an innovative graphical algorithm FEGS to capture structured information buried in the protein sample. Te structure features produced better results than the other feature schemes. Secondly, we implemented a deep learning model called HDnet for the frst time as a learning model for recognizing IGPs.
Despite enhancing the model's overall performance, further gaps still exist for future, such as several previous publications like Tang et al. [27] established public webservers that can enrich the applicability of the anticipated model. Also, using novel feature selection algorithms is vital to avoid overftting and improve the generalization power of the trained model. We hope that the proposed IGPred-HDnet will become a potential tool for large-scale IGPs characterization in particular and other protein problems in general.

Data Availability
Te dataset analyzed in this study can be found in the supplementary fles.

Conflicts of Interest
Te authors declare that they have no conficts of interest.

Acknowledgments
Te researchers would like to thank the Deanship of Scientifc Research, Qassim University for funding the publication of this project.

Supplementary Materials
Supplementary File S1 contains the positive samples (immunoglobulins sequences). Supplementary File S2 contains the negative samples (nonimmunoglobulins sequences). (Supplementary Materials)